Building Compact N-gram Language Models Incrementally
نویسنده
چکیده
In traditional n-gram language modeling, we collect the statistics for all n-grams observed in the training set up to a certain order. The model can then be pruned down to a more compact size with some loss in modeling accuracy. One of the more principled methods for pruning the model is the entropy-based pruning proposed by Stolcke (1998). In this paper, we present an algorithm for incrementally constructing an n-gram model. During the model construction, our method uses less memory than the pruning-based algorithms, since we never have to handle the full unpruned model. When carefully implemented, the algorithm achieves a reasonable speed. We compare our models to the entropy-pruned models in both cross-entropy and speech recognition experiments in Finnish. The entropy experiments show that neither of the methods is optimal and that the entropy-based pruning is quite sensitive to the choice of the initial model. The proposed method seems better suitable for creating complex models. Nevertheless, even the small models created by our method perform along with the best of the small entropy-pruned models in speech recognition experiments. The more complex models created by the proposed method outperform the corresponding entropypruned models in our experiments.
منابع مشابه
Faster and Smaller N-Gram Language Models
N -gram language models are a major resource bottleneck in machine translation. In this paper, we present several language model implementations that are both highly compact and fast to query. Our fastest implementation is as fast as the widely used SRILM while requiring only 25% of the storage. Our most compact representation can store all 4 billion n-grams and associated counts for the Google...
متن کاملCompact n-gram models by incremental g
This work concerns building n-gram language models that are suitable for large vocabulary speech recognition in devices that have a restricted amount of memory and space available. Our target language is Finnish, and in order to evade the problems of its rich morphology, we use sub-word units, morphs, as model units instead of the words. In the proposed model we apply incremental growing and cl...
متن کاملCompact n-gram models by incremental growing and clustering of histories
This work concerns building n-gram language models that are suitable for large vocabulary speech recognition in devices that have a restricted amount of memory and space available. Our target language is Finnish, and in order to evade the problems of its rich morphology, we use sub-word units, morphs, as model units instead of the words. In the proposed model we apply incremental growing and cl...
متن کاملRefinement of a Structured Language Model
A new language model for speech recognition inspired by linguistic analysis is presented. The model develops hidden hierarchical structure incrementally and uses it to extract meaningful information from the word history — thus enabling the use of extended distance dependencies — in an attempt to complement the locality of currently used n-gram Markov models. The model, its probabilistic parame...
متن کاملCoarse-to-Fine Syntactic Machine Translation using Language Projections
The intersection of tree transducer-based translation models with n-gram language models results in huge dynamic programs for machine translation decoding. We propose a multipass, coarse-to-fine approach in which the language model complexity is incrementally introduced. In contrast to previous orderbased bigram-to-trigram approaches, we focus on encoding-based methods, which use a clustered en...
متن کامل